Skip to content

Triton v3.6.x iluvatar backend and 5 TLE primitives support#724

Open
Salamanca001 wants to merge 3 commits into
flagos-ai:triton_v3.6.xfrom
Salamanca001:triton_v3.6.x_iluvatar
Open

Triton v3.6.x iluvatar backend and 5 TLE primitives support#724
Salamanca001 wants to merge 3 commits into
flagos-ai:triton_v3.6.xfrom
Salamanca001:triton_v3.6.x_iluvatar

Conversation

@Salamanca001

Copy link
Copy Markdown
Contributor

This PR brings the Iluvatar backend support onto Triton 3.6 in FlagTree and adds Iluvatar TLE lowering support.
Included commits:

  • 7b4cac885 [BACKEND] update iluvatar backend support on triton3.6.
  • 13abd15d8 [TLE][ILUVATAR] Add TLE support for alloc, local_ptr, copy, extract_tile and insert_tile.
    Main changes:
  • Add third_party/iluvatar backend integration, including compiler/driver entry points, Iluvatar GPU dialect, lowering passes, target info, utility code, build wiring, and test runner.
  • Add Iluvatar TLE dialect/lowering support for alloc, local_ptr, copy, extract_tile, and insert_tile.
  • Add local pointer optimization passes, barrier insertion, TLE-to-LLVM lowering utilities, and registration into the Iluvatar build/dialect pipeline.
  • Wire Iluvatar setup/build support into top-level CMake and Python setup helpers.

@CLAassistant

CLAassistant commented Jun 26, 2026

Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

Comment thread python/setup_tools/setup_helper.py Outdated
Comment thread CMakeLists.txt Outdated

@sunnycase sunnycase left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the work here. Before this is finalized, could you please add a concise summary of the TLE primitive implementation plan?

It would be helpful to cover the main design points, such as the abstraction/lowering flow, compiler/runtime integration points, supported operator scope, dtype/shape/backend limitations, and the validation approach. Could you also include performance data for a few representative operators, ideally with baseline vs. TLE primitive numbers, test shapes, hardware/backend configuration, and measurement methodology?

For the expected level of detail and presentation style, PR #617 could be a useful reference: #617

@Salamanca001 Salamanca001 changed the title Triton v3.6.x iluvatar Triton v3.6.x iluvatar backend and 5 TLE primitives support Jun 29, 2026
@Salamanca001 Salamanca001 force-pushed the triton_v3.6.x_iluvatar branch from 13abd15 to c39b53d Compare June 30, 2026 03:51
@Salamanca001

Copy link
Copy Markdown
Contributor Author

Thanks for the work here. Before this is finalized, could you please add a concise summary of the TLE primitive implementation plan?

It would be helpful to cover the main design points, such as the abstraction/lowering flow, compiler/runtime integration points, supported operator scope, dtype/shape/backend limitations, and the validation approach. Could you also include performance data for a few representative operators, ideally with baseline vs. TLE primitive numbers, test shapes, hardware/backend configuration, and measurement methodology?

For the expected level of detail and presentation style, PR #617 could be a useful reference: #617

[TLE][ILUVATAR] Support TLE Structure on iluvatar backend

This patch adds TLE (Triton Language Extension) structure support to the iluvatar
backend, covering five primitives:

  • tle.gpu.alloc
  • tle.gpu.local_ptr
  • tle.gpu.copy
  • tle.extract_tile
  • tle.insert_tile

1. Overview

The iluvatar backend reuses the shared TLE Python frontend under
python/triton/experimental/tle/ and plugs in backend-specific MLIR builder
bindings and lowering through a dedicated MLIR dialect iluvatar_tle.

  • Python API: shared (no third_party/iluvatar/python/.../tle override).
  • Builder overrides: third_party/iluvatar/tle/triton_iluvatar_tle.cc.
  • MLIR dialect: iluvatar_tle (third_party/iluvatar/tle/include/IR/IluvatarTleOps.td).
  • Optimization passes: 3 local-pointer passes registered in the TTGIR pipeline.
  • LLVM lowering: third_party/iluvatar/tle/lib/Conversion/TleToLLVM/.

2. Supported primitives

2.1 tle.gpu.alloc

  • MLIR: lowers directly to ttg.local_alloc (mutable memdesc in shared memory),
    via create_local_alloc in triton_iluvatar_tle.cc.
  • Coverage: shared-memory buffers with swizzled shared layout
    (make_swizzled_shared_encoding_attr); optional initialized allocation
    (local_alloc(value)).
  • Limitations (enforced in triton_iluvatar_tle.cc):
    • tmem / tensor_memory storage is rejected:
      "iluvatar TLE alloc does not support tmem storage".
    • nv_mma_shared_layout=True is rejected:
      "iluvatar TLE alloc does not support nv_mma_shared_layout=True".
    • Any non-smem storage string is rejected:
      "iluvatar TLE alloc only supports smem storage".
    • In practice tests use nv_mma_shared_layout=False, which routes through the
      swizzled shared layout default builder.

2.2 tle.gpu.local_ptr

  • MLIR: lowers to iluvatar_tle.local_pointers (create_local_pointers).
  • Coverage:
    • Full-view pointer when indices=None (result shape equals buffer shape).
    • Indexed pointers with scalar or block (tensor) indices.
    • 1D and 2D buffers; load / store; masked tails; loops; dot operands; runtime round-trips.
  • Limitations (Python checks in core.py): integer-typed indices only; index
    rank must match buffer rank; scalar and tensor indices cannot be mixed; only
    shared-memory buffers are supported.

2.3 tle.gpu.copy

  • Normal copy (tensor <-> buffered_tensor): lowered via the load/store +
    local-pointer path (normcopy):
    • GM -> local: tl.load(src) -> local_ptr(dst) -> tl.store.
    • local -> GM: local_ptr(src) -> tl.load -> tl.store(dst).
    • There is no dedicated TLE MLIR copy op on iluvatar.
  • Descriptor / TMA copy: not supported on iluvatar.

2.4 tle.extract_tile

  • Coverage: static multi-dim index, static scalar index, dynamic scalar index
    (tl.tensor), and dynamic multi-dim index (auto-linearized).
  • LLVM lowering (tle/lib/Conversion/TleToLLVM/ExtractTileToLLVM.cpp):
    • Tensors must be ranked, must carry an encoding, and the encoding must be
      BlockedEncodingAttr (other encodings rejected with explicit errors).
    • Static + CTA-tile-aligned indices -> register-shuffle path.
    • Dynamic or misaligned indices -> shared-memory relay path.

2.5 tle.insert_tile

  • MLIR: iluvatar_tle.insert_tile (create_insert_tile); the tile shape is
    inferred from the tile operand (no separate tile_shape attribute), and the
    op declares InferTypeOpInterface.
  • Coverage: static multi-dim index, static scalar index, dynamic scalar index.
  • LLVM lowering (InsertTileToLLVM.cpp): same BlockedEncodingAttr-only
    constraint; static CTA-aligned register-merge path vs. dynamic shared-memory path.

3. Lowering path

Key differences from the native Triton path:

Python TLE API
  alloc       -> ttg.local_alloc                                  (direct)
  local_ptr   -> iluvatar_tle.local_pointers
  copy (norm) -> tt.load + iluvatar_tle.local_pointers + tt.store
  extract_tile-> iluvatar_tle.extract_tile
  insert_tile -> iluvatar_tle.insert_tile

TTGIR passes (iluvatar-specific):
  triton-iluvatar-tle-insert-local-pointer-barriers
  triton-iluvatar-tle-optimize-local-pointer-loads
  triton-iluvatar-tle-optimize-local-pointer-stores

make_llir:
  populateTleToLLVMPatterns  (extract_tile / insert_tile / local_pointers)
  -> register shuffle / merge, or shared-memory relay -> LLVM

4. Tests (run in iluvatar CI when FLAGTREE_ILUVATAR_TLE is enabled)

From third_party/iluvatar/test_triton.sh:

Support TLE Tests
python/test/tle/integration/test_tle_local_store.py
python/test/tle/unit/test_tle_gpu_local_ptr.py
python/test/tle/unit/test_extract_tile_static_index.py
python/test/tle/unit/test_extract_tile_dynamic_index.py
python/test/tle/unit/test_insert_tile_static_index.py
python/test/tle/unit/test_insert_tile_dynamic_index.py

5. Performance data

5.1 Measurement methodology

  • Benchmark sources (backend-agnostic tutorials):
    • python/tutorials/tle/01-fft.py

5.2 Environment

Field Value
Hardware Iluvatar Corex
Driver / SDK 4.5.0
Torch 2.10.0
FlagTree triton_v3.6.x

5.3 Representative results

N Triton (ms) TLE (ms) Torch (ms)
0 64.0 0.045962 0.116885 0.022308
1 128.0 0.064135 0.129558 0.036635
2 256.0 0.135442 0.187827 0.056135
3 512.0 0.427827 1.050106 0.122798
4 1024.0 1.268423 3.248211 0.221808

Speedup is computed as baseline_time / TLE_time :

Comparison Mean
TLE FFT vs Triton FFT 0.48x
TLE FFT vs Torch FFT 0.19x

6. Status note

This patch delivers functional support for the five TLE primitives on the
iluvatar backend (correctness validated by unit/integration tests and CI). As the
benchmark results above show, TLE paths are not yet competitive with native Triton
or Torch kernels. Performance optimization is planned for follow-up commits.

@Salamanca001 Salamanca001 force-pushed the triton_v3.6.x_iluvatar branch 2 times, most recently from 7ed5e70 to cc55e41 Compare June 30, 2026 08:04
@Salamanca001 Salamanca001 force-pushed the triton_v3.6.x_iluvatar branch from cc55e41 to c257aa5 Compare June 30, 2026 09:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants